Construction and selection of a finite mixture model for use in clustering and vector quantization

ABSTRACT

A model selection process for selecting a finite mixture model that is suitable for a given task is described. A set of finite mixture models (FMMs) is produced from a set of parameter values and training data for a given task using an Expectation Maximization (EM) process. An FMM from the set of FMMs is selected based on a minimum description length (MDL) value calculated for each of the FMMs.

RELATED APPLICATIONS

[0001] This application claims the benefit of U.S. ProvisionalApplication No. 60/463,618, filed Apr. 16, 2003.

TECHNICAL FIELD

[0002] This invention relates generally to pattern discovery systems,and more particularly to construction and selection of a finite mixturemodel for use in clustering and vector quantization overmulti-dimensional, real-space data.

COPYRIGHT NOTICE/PERMISSION

[0003] A portion of the disclosure of this patent document containsmaterial which is subject to copyright protection. The copyright ownerhas no objection to the facsimile reproduction by anyone of the patentdocument or the patent disclosure as it appears in the Patent andTrademark Office patent file or records, but otherwise reserves allcopyright rights whatsoever. The following notice applies to thesoftware and data as described below and in the drawings hereto:Copyright ©2003, Sony Electronics, Inc., All Rights Reserved.

BACKGROUND

[0004] Clustering aims at organizing or indexing a large dataset in theform of a collection of patterns or clusters based on similarity. Eachcluster corresponds to a subset of the original dataset. From aprobabilistic point of view, this means finding the unknown distributionof datum vectors in a given data set, and the distribution pattern ineach cluster is a local component of the global data distribution. Theoutput of cluster analysis answers two main questions: how many clustersthere are in the dataset, and how they are distributed in the entiredata space. Since correct classification of the data vectors is notknown a priori, the goal is simply to find, for the given dataset, thebest possible cluster patterns extremizing a predefined objectivefunction. Plausible application areas for clustering are, for example,data mining and pattern recognition, among others well-known in the art.

[0005] The purpose of vector quantization is to represent amulti-dimensional dataset by a reduced number of codebook vectors thatapproximate the original dataset with minimal amount of information lossas possible for the given range of compression factors. Examples of useof vector quantization include data compression, and in that context,the same pattern searching algorithm used in cluster analysis is alsoapplicable in designing a codebook as the concise representation of alarge dataset.

[0006] Conventional systems that use clustering and vector quantizationtypically use a predefined input parameter, such as a model complexityparameter, to determine a number of cluster patterns or a number ofcodebooks to be used in an output for a clustering or vectorquantization task respectively. However, the output model will not besuitable for its intended purpose if this input parameter issub-optimal. For example, a vector quantization task given a lessdesirable model complexity parameter will not minimize the amount ofinformation loss for a given dataset during a data compression process.

SUMMARY OF AN EMBODIMENT OF THE INVENTION

[0007] A model selection process for selecting a finite mixture modelthat is suitable for a given task is described. A set of finite mixturemodels (FMMs) is produced from a set of parameter values and trainingdata for a given task using an Expectation Maximization (EM) process. AnFMM from the set of FMMs is selected based on a minimum descriptionlength (MDL) value calculated for each of the FMMs.

BRIEF DESCRIPTION OF THE DRAWINGS

[0008]FIG. 1 illustrates one embodiment of a pattern discovery system.

[0009]FIG. 2 illustrates one embodiment of a model selection processflow.

[0010]FIGS. 3A, 3B, and 3C illustrate one embodiment of an expectationmaximization process flow.

[0011]FIG. 4 illustrates one embodiment of a model description lengthselection process flow.

[0012]FIG. 5 illustrates a network environment suitable for practicingthe invention.

[0013]FIG. 6 illustrates a computer system suitable for practicing theinvention.

DETAILED DESCRIPTION

[0014] In the following detailed description of embodiments of theinvention, reference is made to the accompanying drawings in which likereferences indicate similar elements, and in which is shown by way ofillustration, specific embodiments in which the invention may bepracticed. These embodiments are described in sufficient detail toenable those skilled in the art to practice the invention, and it is tobe understood that other embodiments may be utilized and that logical,mechanical, electrical, functional and other changes may be made withoutdeparting from the scope of the present invention. The followingdetailed description is, therefore, not to be taken in a limiting sense,and the scope of the present invention is defined only by the appendedclaims.

[0015]FIG. 1 illustrates a system-level overview of one embodiment of apattern discovery system 100. The pattern discovery system 100 includesa training module 110, a data storage 120, and an analysis module 130.According to one embodiment, the training module 110 uses a FiniteGaussian Mixture Model and an iterative probabilistic optimizationtechnique, such as an Expectation Maximization (EM) algorithm, for modelparameter estimation to generate a model (e.g., a finite mixture model)that is optimized for a specific task (e.g. vector quantization,clustering, etc.). In one embodiment, the EM algorithm uses a set ofmodel-complexity parameter values (e.g., a set of k values) to generatea set of finite mixture models. A finite mixture model (FMM) is amathematical abstraction of an underlying data-generating process with knumber of component probability distributions. In one embodiment, apenalized likelihood method, which is primarily based on a MinimumDescription Length (MDL) principle, is used to determine a value of kwith respect to a dataset D. The determined value of k with respect tothe dataset D is used to select the FMM from a set of constructed FMMsas will be further described below.

[0016] An input data stream 105 to the training module 110 may includeone or more data streams interfacing with the pattern discovery system100. In one embodiment, one or more input ports may interface the one ormore data streams with the training module 110. Furthermore, thetraining module 110 may receive data to train the training module 110from the data memory 120 via a data stream 125.

[0017] The data storage 120 may store the FMMs, the MDL values, and theset of k values, among other data. The data storage 120 may includerandom access memory, dynamic memory, flash memory, as well as morepermanent data storage devices, such as a magnetic storage device, anoptic storage device and other storage devices well known to those ofordinary skill in the art.

[0018] The analysis module 130, in one embodiment, receives one of theFMMs from the data storage 120 and uses live data provided by datastream 135 to output, model data based on a type of learning task (e.g.,vector quantization, clustering, etc.). In this fashion, the patterndiscovery system 100 uses an extended variant of stochastic complexityas an objective function for finding, for example, a number of codebooksor clusters. In one embodiment, the model data is output via one or moreoutput ports to output 145.

[0019] In one embodiment of the invention assume the dataset D={d₁, d₂,. . . , d_(m)} is a sample of m independently-and-identicallydistributed, real-space datum vectors where the underlying data space isa subset of R^(n). To represent the data-generating process forcomputational purposes, a finite mixture model M, a mathematicalabstraction with k component probability distributions, is introduced,from which the process samples the data. In one embodiment, eachcomponent distribution is assumed to be a Gaussian density function, ƒ,with mean vector μ, and covariance matrix Σ: $\begin{matrix}{{f\left( {\left. x \middle| \mu \right.,\Sigma} \right)} = {\frac{1}{\sqrt{\left. \left( {2\pi} \right)^{n} \middle| \Sigma \right|}}{\exp \left( {{- \frac{1}{2}}\left( {x - \mu} \right)^{T}{\Sigma^{- 1}\left( {x - \mu} \right)}} \right)}}} & (1)\end{matrix}$

[0020] In this embodiment, every component ƒ has a specific relativeweight, α, associated with it. So the set of component distributions aredenoted by the parameter set:

Ψ_(M)={α₁, α₂, . . . , α_(k), μ₁, μ₂, . . . , μ_(k), Σ₁, Σ₂, . . . ,Σ_(k)},  (2)

[0021] and the process model of the data generating mechanism can berepresented by the probability distribution P(d|Ψ_(M)) which is a linearcombination of component densities: $\begin{matrix}{{P\left( d \middle| \Psi_{M} \right)} = {\sum\limits_{c = 1}^{k}{\alpha_{c}{{f\left( {\left. d \middle| \mu_{c} \right.,\Sigma_{c}} \right)}.}}}} & (3)\end{matrix}$

[0022] Under normalization constraint, 0<α_(c)<1 for c=1. . . k, and${\sum\limits_{c = 1}^{k}\alpha_{c}} = 1.$

[0023] This means that every datum vector may belong to multiplecomponents or clusters with different probabilities. Therefore, in thecontext of vector quantization, a codebook for dataset D will be a setof first moments or mean vectors of the mixture model, {μ₁, μ₂, . . . ,μ_(k)}. Whereas in the context of clustering, D will have k patterns indata distribution, each pattern being represented as a weighted Gaussiandensity function: {α_(i), μ_(i), Σ_(i)}.

[0024] For convenience in data analysis, an indicator vector z_(i=[z)_(i1), z_(i2), . . . z_(ik)] is introduced for every data vectord_(i)∈D, where z_(ij)=1 or 0 depending on whether d_(i) is generatedfrom the j-th distribution component or not. If Z={z₁, z₂, . . . z_(m)}is provided along with D, the learning task is to induce a classifier topredict class labels for new data. If Z is not provided along with D,the learning task is to discover the underlying set of classes from thedataset. The focus of the following description is on the second typelearning task, that is, automatic discovery of cluster components andcorresponding distributions in the context of clustering, and automaticdiscovery of codebook in the context of vector quantization.

[0025] In one embodiment, the data vectors are all identicallydistributed, and generated independently, therefore the likelihood of Dfor the given parameterized model Ψ_(M) is: $\begin{matrix}{{P\left( d \middle| \Psi_{M} \right)} = {\prod\limits_{i = 1}^{m}{{P\left( d_{i} \middle| \Psi_{M} \right)}.}}} & (4)\end{matrix}$

[0026] In a maximum-likelihood setting, the model parameters areestimated by maximizing the objective function which is the factoredlikelihood function or its logarithm:

{circumflex over (Ψ)}_(M) =arg max_(ψ) P(D|Ψ _(M))≈arg max_(ψ) log P(D|Ψ_(M))  (5)

[0027] Sometimes the structure of the likelihood function is so complex,that it is analytically intractable to estimate the model parameters bystraight-forward log-likelihood maximization. In that case, therequirement is to simplify the structure of the objective function byintroducing a set of suitable latent variables. In clusteridentification problems, Z is the intended latent variable set, which,in conjunction with the observed data set D, form the complete datasetfor likelihood analysis, for which the seed of the objective function isthe complete-data likelihood, P(D,Z|Ψ_(M)).

[0028]FIG. 2 illustrates one embodiment of a model selection processflow 300 used by the pattern discovery system 100 for selecting a modelfrom a set of FMMs. At block 310, the training module 110 obtains aninput value. The input value may be specified as a range of values or apercentage derivation from a default value based on whether the learningtask is for vector quantization or clustering, among other examples. Atblock 320, the training module 110 creates a set of model-complexityvalues (e.g., a set of k values) based on the input value. For example,in vector quantization, the model-complexity values may be derived froma range of potential compression ratios, and in cluster analysis, modelcomplexity values are directly obtained from the pattern-setcardinalities in a range of potential cluster patterns.

[0029] At block 340, the training module 110 performs an EM process foreach of the set of values. The EM algorithm is a parameter estimationtechnique that is applied for estimating the parameters of the finitemixture model. An example of the EM process is further described belowand in conjunction with FIGS. 3A, 3B, and 3C.

[0030] At block 350, the training module 110 stores a set of FMMsgenerated from the EM process of block 340. At block 360, the trainingmodule 110 calculates and stores a set of MDL values for each of theFMMs. At block 380, the training module 110 selects a model from the setof FMMs. In one embodiment, the training module 110 selects a model thatis associated with the MDL that has the smallest value, as will befurther described in conjunction with FIG. 4 below. Once the model isselected from the set of FMMs, the remaining FMMs may be removed fromthe data storage 120.

[0031] It is understood that the model selection process 300 is notlimited to the process flow as shown in FIG. 2. Rather alternativeprocess flows may be implemented, such as a process flow in which blocks340, 350, and 360 loop in an iterative process for each value of k,among other implementation designs well known to those of ordinary skillin the art.

[0032] It is also understood that, in one embodiment, the calculation ofthe MDL values at block 360 performs hypothesis induction by seeking amodel that enables a compact encoding of both parameters and data. For amixture model M with model complexity K_(M)(a random variable taking anyintegral value k∈[k_(min), k_(max)]), the marginalized likelihood is:

P(D|K _(M))=∫dψP(Ψ_(M)|K_(M))P(D|K _(M), Ψ_(M)),  (6)

[0033] where $\begin{matrix}{{{P\left( {\left. D \middle| K_{M} \right.,\Psi_{M}} \right)} = {\prod\limits_{i = 1}^{m}{P\left( {\left. d_{i} \middle| K_{M} \right.,\Psi_{M}} \right)}}},} & (7)\end{matrix}$

[0034] and P(Ψ_(M)|K_(M)) is the parameter prior.

[0035] Following the traditional maximum-likelihood approach, a valuefor K_(M) can be obtained by maximizing the marginalized log-likelihood:

{circumflex over (K)} _(M) =arg max_(k) log P(D|K _(M)).  (8)

[0036] Using the MDL principle, the asymptotic approximation of themarginalized log-likelihood under regularity conditions is:$\begin{matrix}{{\log \quad {P\left( D \middle| K_{M} \right)}} \approx {{\log \quad {P\left( {\left. D \middle| K_{M} \right.,\Psi_{M}} \right)}} - {\frac{\left| \Psi_{M} \right|}{2}\log \quad m} - {\frac{1}{2}{\sum\limits_{c = 1}^{k}{{\Psi_{c}}\log \quad \alpha_{c}}}}}} & (9)\end{matrix}$

[0037] where |Ψ_(M)| is the total number of parameters required tospecify a finite mixture model with K_(M)(=k) components, and |Ψ_(c)| isthe number of parameters to define each component. This marginalizedlog-likelihood leads to the minimization of the following MDL objectivefunction: $\begin{matrix}\begin{matrix}{{{MDL}\left( {\hat{K}}_{M} \right)} = {{{- 2}\log \quad {P\left( {\left. D \middle| K_{M} \right.,\Psi_{M}} \right)}} -}} \\{{{{\Psi_{M}}\log \quad m} - {\sum\limits_{c = 1}^{k}{{\Psi_{c}}\log \quad {\alpha_{c}.}}}}}\end{matrix} & (10)\end{matrix}$

[0038] MDL criterion for model selection may be asymptotically Bayesian.

[0039] After the model has been selected with the model selectionprocess 300, the selected model may be used for analysis as previouslydescribed for FIG. 1. For example, upon selecting the model, theanalysis module 130 may output the mean vectors {μ₁, μ₂, . . . μ_(k)} asthe codebook of D if the learning task is for vector quantization.Furthermore, the analysis module 130 may output Ψ_(M) ^((l))(weight,motion vectors, variance) as the final parameter mixture estimate ofpatterns if the learning task is for clustering.

[0040] Furthermore, it is understood that the invention is not limitedto the processing of the model selection process 300 described in FIG.2. Rather one of ordinary skill in the art will recognize that in otherembodiments, alternative processes may be used. For example, thefollowing illustrates alternative pseudo-code according to oneembodiment: Set  k ← k_(min); l ← 1 While  (k ≤ k_(max))$\left\{ {\left. {{Set}\quad \left\{ {\alpha_{1},\alpha_{2},\ldots \quad,\alpha_{k}} \right\}}\leftarrow{{\left\{ {\alpha_{1}^{0},\alpha_{2}^{0},\ldots \quad,\alpha_{k}^{0}} \right\}.{Set}}\quad \left\{ {\mu_{1},\mu_{2},\ldots \quad,\mu_{k}} \right\}}\leftarrow{{\left\{ {\mu_{1}^{0},\mu_{2}^{0},\ldots \quad,\mu_{k}^{0}} \right\}.{Set}}\quad \left\{ {\sum_{1}{,{\sum_{2}{,\ldots \quad,\sum_{k}}}}} \right\}}\leftarrow{{{\left\{ {\sum_{1}^{0}{,{\sum_{2}^{0}{,\ldots \quad,\sum_{k}^{0}}}}} \right\}.{Run}}\quad {EM}\quad {until}\quad \max \left\{ {{\mu_{j}^{({t + 1})} - \mu_{j}^{(t)}}} \right\}_{j = 1}^{k}} < {ɛ_{\mu}\quad {and}\quad \max \left\{ {{\sum_{j}^{({t + 1})}{- \sum_{j}^{(t)}}}} \right\}_{j = 1}^{k}} < {ɛ_{\sum}{Set}\quad \Psi_{M}^{(l)}}}\leftarrow{{\left\{ {{\alpha_{1,}\alpha_{2}},\ldots \quad,\alpha_{k},\mu_{1},\mu_{2},\ldots \quad,\mu_{k},{\sum_{1}{,{\sum_{2}{,\ldots \quad,\sum_{k}}}}}} \right\}.{Set}}\quad {{MDL}\left( {\hat{K}}_{M} \right)}^{(l)}}\leftarrow{{{{MDL}(k)}.{Set}}\quad k}\leftarrow{k + 1} \right.;\left. l\leftarrow{l + 1.} \right.} \right\}$$\left. {{Find}\quad \hat{k}}\leftarrow{\arg \quad {\min\limits_{k}\left\{ {{MDL}(k)} \right\}_{k = k_{\min}}^{k_{\max}}}} \right.;$Find  l  corresponding  to  k̂;If  the  task  is  vector  quantization{Output  {μ₁, μ₂, …  μ_(k)}  as  t  he  codebook  of  D;}else{Output  Ψ_(M)^((l))  as  the  final  parameter  mixture  estimate;}

[0041] In addition, it is understood that the invention is not limitedto the EM process as described herewith. Rather one of ordinary skill inthe art will recognize that other EM algorithms might be modified toperform similar functions as described herein. For example, the EMprocess for each of the set of values may be performed in an iterativeprocess. In one embodiment each iteration of the EM process has twosteps: an E (Expectation)-step and an M(Maximization)-step. In theE-step, the expected value of P(D,Z|Ψ_(M)) is computed with respect tothe marginal distribution of Z:

Q(Ψ_(M), {circumflex over (Ψ)}_(M) ^(i−1))=E└log P(D,Z|Ψ _(M))|D,{circumflex over (Ψ)} _(M) ^(i−1) ┘=∫dzP(Z|D, {circumflex over (Ψ)} _(M)^(i−1))log P(D,Z|Ψ _(M)).  (11)

[0042] In the Q function, P(Z|D, {circumflex over (Ψ)}_(M) ^(i−1)) isthe marginal distribution of the latent variables, which is dependent onthe observed data and the most recent estimate of the model parameters.

[0043] In the M-step, the Q function is maximized to obtain the newestimate of Ψ_(M) for the next iteration:

{circumflex over (Ψ)}_(M) ^(i) =arg max_(ψ) Q(Ψ_(M), {circumflex over(Ψ)}_(M) ^(i−1))  (12)

[0044] {circumflex over (Ψ)}_(M) ^(i), as computed in the M-step, isapplied in the next E-step, and the EM algorithm continues until thepredefined termination criteria is met. In each iteration,log-likelihood is increased monotonically, and the algorithm isguaranteed to converge to local maxima of the likelihood function.

[0045] For the incomplete dataset D, the log-likelihood will be:$\begin{matrix}\begin{matrix}{{L\left( \Psi_{M} \middle| D \right)} = {\log \quad {P\left( D \middle| \Psi_{M} \right)}}} \\{= {\log {\prod\limits_{i = 1}^{m}{P\left( d_{i} \middle| \Psi_{M} \right)}}}} \\{= {\underset{i = 1}{\overset{m}{\sum\quad}}{\log {\sum\limits_{c = 1}^{k}{\alpha_{c}{f\left( {\left. d_{i} \middle| \mu_{c} \right.,\Sigma_{c}} \right)}}}}}}\end{matrix} & (13)\end{matrix}$

[0046] L(Ψ_(M)), in its current functional form, is difficult foroptimization. It can be simplified by introducing the latentclass-indicator variable set, Z. For every complete datum vector (d_(i),z_(i)), corresponding likelihood probability under finite Gaussianmixture model assumption will be: $\begin{matrix}{{P\left( {d_{i},\left. z_{i} \middle| \Psi_{M} \right.} \right)} = {\prod\limits_{c = 1}^{k}{\alpha_{c}^{z_{ic}}{f\left( {\left. d_{i} \middle| \mu_{c} \right.,\Sigma_{c}} \right)}^{z_{ic}}}}} & (14)\end{matrix}$

[0047] Following equation (14), the complete-data log-likelihood willbe: $\begin{matrix}\begin{matrix}{{L\left( {\left. \Psi_{M} \middle| D \right.,Z} \right)} = {\log \quad {P\left( {D,\left. Z \middle| \Psi_{M} \right.} \right)}}} \\{= {\log {\prod\limits_{i = 1}^{m}{P\left( {d_{i},\left. z_{i} \middle| \Psi_{M} \right.} \right)}}}} \\{= {\sum\limits_{i = 1}^{m}{\sum\limits_{c = 1}^{k}{z_{ic}{\log \left\lbrack {\alpha_{c}{f\left( {\left. d_{i} \middle| \mu_{c} \right.,\Sigma_{c}} \right)}} \right\rbrack}}}}}\end{matrix} & (15)\end{matrix}$

[0048] EM algorithm iterates over this objective function, computingalternately the expected values of Z and Ψ_(M).

[0049]FIGS. 3A, 3B, and 3C illustrate one embodiment of an EM processflow 400 for performing the EM process described in conjunction withblock 340 above. Assuming the k density components of the mixture modelare denoted as C₁, C₂, . . . , C_(k), the EM process flow 400 willproceed in the following way. At block 403, the training module 110 setsa value i to one. At block 406, the training module 110 receives thei-th data point. At block 409, the training module 110 sets a j value toone. At block 412, the training module 110 receives the j-th patternhandle. At block 415, the training module 110 computes the likelihood ofthe i-th data vector for the j-th pattern. In this fashion, theclass-conditional likelihood probability of every i-th datum vector forevery j-th latent class, P(d_(i)|C_(j), Ψ_(M)), is computed. At block418, the training module 110 increments the j value by one. At block421, the training module 110 determines whether the current j valueexceeds the number of patterns. If the training module 110 determinesthe current j value exceeds the number of patterns, control passes toblock 424. If the training module 110 determines the current j valuedoes not exceed the number of patterns, control returns to block 412 andthe process flow continues as described above.

[0050] At block 424, the training module 110 increments the i valueby 1. At block 427, the training module 110 determines whether the ivalue exceeds the number of data points. If the training module 110determines the i value exceeds the number of data points, control passesto block 430. If the training module 110 determines the i value does notexceed the number of data points, control returns to block 406 and theprocess flow continues as described above.

[0051] At block 430, the training module 110 sets the i value to one. Atblock 433, the training module 110 receives the i-th data point. Atblock 436, the training module 110 sets the j value to one. At block439, the training module 110 receives the j-th pattern handle. At block442, the training module 110 computes the posterior class probability ofthe j-th pattern for the i-th data vector. In this fashion, theposterior class probability of every j-th latent class with respect toevery i-th datum vector, P(C_(j)|d_(i), Ψ_(M)), is computed. At block445, the training module 110 increments the j value by one. At block448, the training module 110 determines whether j exceeds the number ofpatterns. If the training module 110 determines the j value exceeds thenumber of patterns, control passes to block 451. If the training module110 determines the j value does not exceed the number of patterns,control returns to block 439 and the process flow continues as describedabove.

[0052] At block 451, the training module 110 increments the i value byone. At block 454, the training module 110 determines whether the ivalue exceeds the number of data points. If the training module 110determines the i value exceeds the number of data points, control passesto block 457. If the training module 110 determines the i value does notexceed the number of data points, control returns to block 433 and theprocess flow continues as described above.

[0053] At block 457, the training module 110 sets the j value to 1. Atblock 460, the training module 110 receives the j-th pattern handle. Atblock 463, the training module 110 computes the mixture proportion ofthe j-th class. At block 466, the training module 110 computes a firstmoment of the j-th class. At block 469, the training module 110 computesa second moment of the j-th class. Therefore, based on the statisticsprovided by the probability distributions at each EM iteration,parameters of all the mixture components, {α_(j), μ_(j), Σ_(j)}_(j=1)^(k), are computed.

[0054] At block 472, the training module 110 increments the j value byone. At block 475, the training module 110 determines whether the ivalue is less than or equal to the j value; and whether the j value isless than or equal to the k value. If the training module 110 determinesthe i value is less than or equal to the j value; and the j value isless than or equal to the k value, control passes to block 478. If thetraining module 110 determines the i value is not less than or equal tothe j value; or the j value is not less than or equal to the k value,control returns to block 460 and the process flow continues as describedabove.

[0055] At block 478, the training module 110 determines whether atermination criteria is met. If the training module 110 determines thetermination criteria is met, control passes to block 481 and the processflow 400 ends. If the training module 110 determines the terminationrequirement is not met, control returns to block 430 and the processflow continues as described above. In one embodiment, the terminationcriteria of every EM run will be based on some pre-defined error marginon first and second moments of Gaussian densities:

max{|μ_(j) ^((t+1))−μ_(j) ^((t))|}_(j=1) ^(k)<ε_(μ) and max{|Σ_(j)^((t+1))−Σ_(j) ^((t))|}_(j=1) ^(k)<ε_(Σ)  (14)

[0056] In one embodiment, the following pseudo-code illustrates thecomplete EM optimization for a given model complexity {circumflex over(K)}_(M)=k as follows: Set  t ← 0;

Repeat { for (i = 1 to m) {  for(j = 1 to k)  {$\left. {{\left. {{{P\left( d_{i} \right.}C_{j}},\Psi_{M}} \right)^{({t + 1})} = {{P\left( d_{i} \right.}{\hat{\mu}}_{j}^{(t)}}},{\hat{\sum}}_{j}^{(t)}} \right);$

 } } for(i = 1 to m) {  for (j = 1 to k)  {${\left. {{{P\left( C_{j} \right.}d_{i}},\Psi_{M}} \right)^{({t + 1})} = \frac{\left. {{\alpha_{j}^{(t)}{P\left( d_{i} \right.}C_{j}},\Psi_{M}} \right)^{({t + 1})}}{\sum\limits_{c = 1}^{k}{\alpha_{c}^{(t)}{P\left( {d_{i}\left. {C_{c},\Psi_{M}} \right)^{({t + 1})}} \right.}}}};$

 } } for(j = 1 to k) {  $\alpha_{j}^{({t + 1})} = {\sum\limits_{i = 1}^{m}{P\left( {{C_{j}\left. {d_{i},\Psi_{M}} \right)^{({t + 1})}};} \right.}}$

 $\mu_{j}^{({t + 1})} = {\frac{1}{\alpha_{j}^{({t + 1})}}{\sum\limits_{i = 1}^{m}{P\left( {{C_{j}\left. {d_{i},\Psi_{M}} \right)^{({t + 1})}d_{i}};} \right.}}}$

 $\sum\limits_{j}^{({t + 1})}{= {\frac{1}{\alpha_{j}^{({t + 1})}}{\sum\limits_{i = 1}^{m}{P\left( {C_{j}\left. {d_{i},\Psi_{M}} \right)^{({t + 1})}\left( {d_{i} - \mu_{j}^{({t + 1})}} \right)\left( {d_{i} - \mu_{j}^{({t + 1})}} \right)^{T}} \right.}}}}$

} Set  t ← t + 1;

}${{Until}\quad \max \left\{ {{\mu_{j}^{({t + 1})} - \mu_{j}^{(t)}}} \right\}_{j = 1}^{k}} < {ɛ_{\mu}\quad {and}\quad \max \left\{ {{\sum\limits_{j}^{({t + 1})}{- \sum\limits_{j}^{(t)}}}} \right\}_{j = 1}^{k}} < ɛ_{\sum}$

[0057]FIG. 4 illustrates one embodiment of MDL Selection process flow500 for selecting a model from the set of FMMs. At block 505, thetraining module 110 sets the MDL_OPT value to an arbitrary high value.At block 510, the training module 110 sets an i value to a minimum kvalue. At block 520, the training module 110 receives an MDL valueassociated with the current i value. At block 530, the training module110 determines whether the MDL value is less than the MDL_OPT value. Ifthe training module 110 determines the MDL value is less than theMDL_OPT value, control passes to block 540. If the training module 110determines the MDL value is not less than the MDL_OPT value, controlpasses to block 550.

[0058] At block 540, the training module 110 sets the MDL_OPT value tothe MDL value associated with the current i value. The training module110 also sets a K_OPT value to the current i value, where MDL(K_OPT) isthe smallest MDL value.

[0059] At block 550, the training module 110 increments the i value byone. At block 560, the training module 110 determines whether the ivalue is less than or equal to the maximum value of k. If the trainingmodule 110 determines the i value is less than or equal to the maximumvalue of k, control returns to block 520. If the training module 110determines the i value is not less than or equal to the maximum value ofk, control passes to block 565.

[0060] At block 565, the training module selects FMM(K₁₃ OPT) from theset of FMMs. In this fashion, the training module 110 selects the FMMfrom the set of FMMs for which model complexity equals K_OPT.

[0061] At block 570, the training module 110 determines whether the taskis a vector quantization task. If the training module 110 determines thetask is a vector quantization task, control passes to block 580. If thetraining module 110 determines the task is not a vector quantizationtask but a clustering task, control passes to block 590.

[0062] At block 580, the training module 110 outputs means vectors asthe codebook of D. At block 590, the training module 110 outputs a fulldistribution model as the final parameter mixture estimate that will beinterpreted as the set of discovered cluster patterns.

[0063] In one embodiment, as shown in FIG. 5, a computer 601 is part of,or coupled to a network 605, such as the Internet, to exchange data withanother computer 603, as either a client or a server computer.Typically, a computer couples to the Internet through an ISP (InternetService Provider) 607 and executes a conventional Internet browsingapplication to exchange data with a server. Other types of applicationsallow clients to exchange data through the network 605 without using aserver. It is readily apparent that the present invention is not limitedto use with the Internet; directly coupled and private networks are alsocontemplated.

[0064] One embodiment of a system 740 suitable for use in theenvironments of FIG. 5 is illustrated in FIG. 6. The system 740 includesa processor 750, memory 755 and input/output capability 760 coupled to asystem bus 765. The memory 755 is configured to store instructionswhich, when executed by the processor 750, perform the methods describedherein. The memory 755 may also store data such as the set of FMM andthe set of values. Input/output 760 provides for the delivery anddisplay of the data or portions or representations thereof. Input/output760 also encompasses various types of machine or computer-readablemedia, including any type of storage device that is accessible by theprocessor 750. One of skill in the art will immediately recognize thatthe term “computer-readable medium/media” or “machine-readablemedium/media” further encompasses a carrier wave that encodes a datasignal. It will also be appreciated that the computer is controlled byoperating system software executing in memory 755. Input/output andrelated media 760 store the machine/computer-executable instructions forthe operating system and methods of the present invention as well as theset of FMMs and the set of values.

[0065] The description of FIGS. 5 and 6 is intended to provide anoverview of computer hardware and various operating environmentssuitable for implementing the invention, but is not intended to limitthe applicable environments. It will be appreciated that the system 740is one example of many possible devices that have differentarchitectures. A typical device will usually include at least aprocessor, memory, and a bus coupling the memory to the processor. Sucha configuration encompasses personal computer systems, networkcomputers, television-based systems, such as Web TVs or set-top boxes,handheld devices, such as cell phones and personal digital assistants,and similar devices. One of skill in the art will immediately appreciatethat the invention can be practiced with other system configurations,including multiprocessor systems, minicomputers, mainframe computers,and the like. The invention can also be practiced in distributedcomputing environments where tasks are performed by remote processingdevices that are linked through a communications network.

[0066] It will be appreciated that more or fewer processes may beincorporated into the methods illustrated in FIGS. 2, 3A, 3B, 3C, and 4without departing from the scope of the invention and that no particularorder is implied by the arrangement of blocks shown and describedherein. Describing the methods by reference to a flow diagram enablesone skilled in the art to develop such programs including suchinstructions to carry out the methods on suitably configured computers(the processor of the computer executing the instructions fromcomputer-readable media, including memory). The computer-executableinstructions may be written in a computer programming language or may beembodied in firmware logic. If written in a programming languageconforming to a recognized standard, such instructions can be executedon a variety of hardware platforms and for interface to a variety ofoperating systems. In addition, the present invention is not describedwith reference to any particular programming language. It will beappreciated that a variety of programming languages may be used toimplement the teachings of the invention as described herein.Furthermore, it is common in the art to speak of software, in one formor another (e.g., program, procedure, process, application, module,logic, etc), as taking an action or causing a result. Such expressionsare merely a shorthand way of saying that execution of the software by acomputer causes the processor of the computer to perform an action orproduce a result.

[0067] A model selection process has been described to select a suitablemodel from a finite set of FMMs. Furthermore, a dynamic model-complexityparameter, which is not predefined but is determined through a learningprocess, has also been described, to select the model. It is understoodthat the model selected by the pattern discovery system 100 may be usedwith data for in fields such as financial investment, data mining,pattern recognition (e.g., voice recognition, hand-writing recognition,etc.), texture and image segmentation, boundary detection and surfaceapproximation, magnetic resonance imaging, handwritten characterrecognition, computer vision, and information retrieval among otherapplications well known to those of ordinary skill in the art.

[0068] Although specific embodiments have been illustrated and describedherein, it will be appreciated by those of ordinary skill in the artthat any arrangement which is calculated to achieve the same purpose maybe substituted for the specific embodiments shown. This application isintended to cover any adaptations or variations of the presentinvention.

What is claimed is:
 1. A computerized method comprising: producing a setof finite mixture models (FMM)s from a set of parameter values andtraining data using an Expectation Maximization (EM) process;calculating a minimum description length (MDL) value for each of the setof FMMs; and selecting a FMM based on the corresponding MDL values. 2.The method of claim 1, wherein each parameter value is amodel-complexity parameter for the EM process.
 3. The method of claim 1,wherein selecting comprises selecting the FMM from the set of FMMscorresponding to the MDL having a smallest MDL value.
 4. The method ofclaim 1, further comprising applying the FMM from the set of FMMs tovector quantize a stream of data.
 5. The method of claim 4, wherein theFMM of the set of FMMs defines a vector quantization codebook.
 6. Themethod of claim 1, further comprising applying the FMM from the set ofFMMs to cluster a stream of data.
 7. The method of claim 6, wherein theFMM from the set of FMMs defines a cluster pattern.
 8. Amachine-readable medium having executable instructions to cause a deviceto perform a method comprising: producing a set of finite mixture models(FMM)s from a set of parameter values and training data using anExpectation Maximization (EM) process; calculating a minimum descriptionlength (MDL) value for each of the set of FMMs; and selecting a FMMbased on the corresponding MDL values.
 9. The machine-readable medium ofclaim 8, wherein each parameter value is a model-complexity parameterfor the EM process.
 10. The machine-readable medium of claim 8, whereinselecting comprises selecting the FMM from the set of FMMs correspondingthe MDL having a smallest MDL value.
 11. The machine-readable medium ofclaim 8, further comprising applying the FMM from the set of FMMs tovector-quantize a stream of data.
 12. The machine-readable medium ofclaim 11, wherein the FMM from the set of FMMs defines a vectorquantization codebook.
 13. The machine-readable medium of claim 8,further comprising applying the FMM from the set of FMMs to cluster astream of data.
 14. The machine-readable medium of claim 13, wherein theFMM from the set of FMMs defines a cluster pattern.
 15. A systemcomprising: a processor coupled to a memory through a bus; and a modelselection process executed by the processor from the memory to cause theprocessor to produce a set of finite mixture models (FMM)s from a set ofparameter values and training data using an Expectation Maximization(EM) process, to calculate a minimum description length (MDL) value foreach of the set of FMMs, and to select a FMM based on the correspondingMDL values.
 16. The system of claim 15, wherein each parameter value isa model-complexity parameter for the EM process.
 17. The system of claim15, wherein the model selection process further causes the processor,when selecting, to select the FMM from the set of FMMs corresponding tothe MDL having a smallest value.
 18. The system of claim 15, wherein themodel selection process further causes the processor to apply the FMMfrom the set of FMMs to vector quantize a stream of data.
 19. The systemof claim 18, wherein the FMM from the set of FMMs defines a vectorquantization codebook.
 20. The system of claim 15, wherein the modelselection process further causes the processor to apply the FMM from theset of FMMs to cluster a stream of data.
 21. The system of claim 20,wherein the FMM from the set of FMMs defines a cluster pattern.
 22. Anapparatus comprising: means for producing a set of finite mixture models(FMM)s from a set of parameter values and training data using anExpectation Maximization (EM) process; means for calculating a minimumdescription length (MDL) value for each of the set of FMMs; and meansfor selecting a FMM based on the corresponding MDL values.
 23. Theapparatus of claim 22, wherein each parameter value is amodel-complexity parameter for the EM process.
 24. The apparatus ofclaim 22, wherein the means for selecting comprises means for selectingthe FMM from the set of FMMs corresponding to the MDL having a smallestMDL value.
 25. The apparatus of claim 22, further comprising a means forapplying the FMM from the set of FMMs to vector-quantize a stream ofdata.
 26. The apparatus of claim 25, wherein the FMM from the set ofFMMs defines a vector quantization codebook.
 27. The apparatus of claim22, further comprising a means for applying the FMM from the set of FMMsto cluster a stream of data.
 28. The apparatus of claim 27, wherein theFMM from the set of FMMs defines a cluster pattern.